Marketing Analytics Process

Inferential Modeling Workflow

Linear Models

Last time we translated a story into a statistical model. In general, a model looks like this:

\[y = \beta_0 + \beta_1 x + \epsilon\]

Because \(y = \beta_0 + \beta_1 x\) is a linear equation, adding \(\epsilon\) makes this a linear model.

  • \(y\) is the outcome variable (a.k.a., response or dependent variable).
  • \(x\) is an explanatory variable (a.k.a., a predictor or independent variable, feature, or covariate).
  • \(\beta_0\) is the intercept parameter.
  • \(\beta_1\) is a slope parameter.
  • \(\epsilon\) is the error term.

But what kind of error?

If the error includes every other explanatory variable beyond the X that we’ve included in the model, what does the sum of all of their effects look like?

We have to assume \(\epsilon\) is distributed a certain way, typically:

\[\epsilon \sim Normal(0, 1)\]

The normal distribution, and every other probability distribution, has a few essential components:

  • The support are the values on the x-axis that have non-zero probability.
  • The probability of some subset of the support is the associated area under the curve.

The normal distribution is one of many probability distributions. Here’s the uniform distribution.

  • What is the support?
  • What value in the support is most likely?

Okay, but why should we use a linear model and assume \(\epsilon \sim Normal(0, 1)\)?

  • A linear model is simple but a good assumption when we don’t know a lot about how our data is created (the story). It says the outcome variable is the result of summing up the effects of the explanatory variable(s).
  • The normal distribution shows up a lot in nature and, like the linear model, is a good assumption when we don’t know a lot about the story. It says the effects of all the variables we don’t include in the model add up to something that looks like normal error.

A linear model with normal error and a continuous outcome \(y\) is known as a regression.

Simulating Data

We now have outlined a complete model:

\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]

For illustration, let’s assume that our model captures the full story, choose values for the parameters, and generate or simulate data using the model.

How do we simulate data from this?

\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]

How do we simulate data from this?

\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]

  • Choose values for \(\beta_0\) and \(\beta_1\).
  • Obtain random values for \(x\).
  • Obtain random values for \(\epsilon\).
  • Add them together!

Whenever we are using randomization, for example, when simulating data or running a model that relies on randomization, we want to use set.seed() so we can get the same results every time we render.

# Random number are, well, random.
rnorm(1)
## [1] -1.431754
rnorm(1)
## [1] -0.6508611
rnorm(1)
## [1] -1.824262

# But using set.seed() we at least start at the *same* random number each time.
set.seed(42)
rnorm(1)
## [1] 1.370958
set.seed(42)
rnorm(1)
## [1] 1.370958
set.seed(42)
rnorm(1)
## [1] 1.370958

# Load packages.
library(tidyverse)

# Set the randomization seed to simulate the same data each time.
set.seed(42)

# Set the parameter values.
beta0 <- 3
beta1 <- 7

# Simulate data.
sim_data <- tibble(
  x = runif(100, min = 0, max = 7),
  y = beta0 + beta1 * x + rnorm(100, mean = 0, sd = 3)
)

Plotting Our Simulated Data

sim_data |> 
  ggplot(aes(x = x, y = y)) +
  geom_point()

Parameter Estimates

Okay, what have we talked about so far?

  • We use inferential models to understand a process we don’t observe.
  • We translate a story about our data into a statistical model.
  • If we have a continuous outcome \(y\), a regression is a good model to start with.

Our goal is to use the model to estimate the unobserved parameters from the data (i.e., make our best guess).

In other words, an inferential model extracts parameter estimates from the data to inform our managerial decision.

Estimating \(\beta_0\) and \(\beta_1\)

Finding the Best Line

Finding the Best Line

Finding the Best Line

Finding the Best Line

Minimizing the Sum of Squared Residuals

The best line should be the one that makes the sum of the vertical bars as small as possible.

The vertical bars are called residuals, and represent the distance between the data \(y\) and a particular line.

Residuals can be positive and negative, so we make the sum of the squared residuals as small as possible.

Minimizing the Sum of Squared Residuals

Using tidymodels

The tidymodels framework is a collection of packages for modeling using tidyverse principles. The goal is to provide the same consistency and ease-of-use for modeling that the tidyverse provides for importing, wrangling, visualizing, and reporting data.

# Load packages.
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom        1.0.9     ✔ rsample      1.3.1
## ✔ dials        1.4.1     ✔ tune         2.0.0
## ✔ infer        1.0.9     ✔ workflows    1.3.0
## ✔ modeldata    1.5.1     ✔ workflowsets 1.1.1
## ✔ parsnip      1.3.3     ✔ yardstick    1.3.2
## ✔ recipes      1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()

Wait…We Used the Tidyverse…Now We Have to Learn Tidymodels?

Yes! But you’ll notice that the way we write code stays remarkably consistent.

Specify the Model Type and Engine

In the tidymodels framework, we specify the model type and then set the engine we’d like to estimate the model with. The engine is typically another R package that would normally require its own unique syntax.

This allows us to use a variety of modeling packages while keeping the same syntax.

# Specify the model type and engine.
reg_model <- linear_reg() |> 
  set_engine("lm")

Fit the Model

When we fit a linear model (a.k.a., training, calibrating, or estimating the model) we are finding the line of best fit and getting parameter estimates.

To do this the fit() function uses formula notation where the model is specified with outcome ~ explanatory variable(s).

# Fit the model.
(reg_model <- reg_model |>
  fit(y ~ x, data = sim_data))
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = y ~ x, data = data)
## 
## Coefficients:
## (Intercept)            x  
##       2.239        7.187

View Model Results

reg_model |>
  tidy()
## # A tibble: 2 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     2.24     0.559      4.01 1.20e- 4
## 2 x               7.19     0.132     54.4  4.75e-75

Live Coding Exercise

Our client, Kroger, has requested that we walk them through the basics of linear regression using a hypothetical example. Let’s construct an example, define our story, translate it to a model, use that model to simulate data, and fit our model to the simulated data.

Wrapping Up

Summary

  • Discussed linear models.
  • Walked through parameter estimation for a linear model.
  • Introduced tidymodels to fit models.

Next Time

  • Evaluating model fit.

Supplementary Material

  • Tidy Modeling with R Chapter 6.1

Artwork by @allison_horst

Exercise 7

Kroger is specifically interested in whether their spend on promotions (Any_Promo_Spend) and discounts (Price_Decr_Spend) is driving their sales (Sales) in the soup category. Do the following:

  1. Import the soup data and the tidyverse and tidymodels libraries.
  2. Create a regression model and fit it to the appropriate variables.
  3. Examine the coefficients - do promotional spend and price decreases appear to have a positive or negative effect on sales?
  4. Render the Quarto document into Word and upload to Canvas.